2023-07-25

Introduction

Exploratory Data Analysis of daily air quality in New York, from May to September, 1973
Code book : As per the help page of airquality data set from the R datasets package

  • Ozone - Mean ozone in parts per billion(ppb) from 1 to 3 pm at Roosevelt Island
  • Solar.R - Solar radiation in Langleys(lang) in the frequency band 4000-7000 Angstroms from 8 am to 12 pm at Central Park
  • Wind - Average wind speed in miles per hour(mph) from 7 to 10 am at LaGuardia Airport
  • Temp - Maximum daily temperature in degrees Fahrenheit(\(^\circ\)F) at LaGuardia Airport
  • Month - Month for the measurement date
  • Day - Day of the month for the measurement date
if(system.file(package = "datasets") == "") install.packages("datasets")
library(datasets)
data("airquality")

Data Processing

27.45% of rows have missing values and these values do have a discernable pattern, as most of them are missing in the month of June, thus they are of the type MAR
These will be imputed with the impute.knn function from the impute package with the number of neighbours to be used for imputation = 5 and rng seed set to 325

if(system.file(package = "impute") == "") install.packages("impute")
library(impute)
dat <- as.data.frame(impute.knn(as.matrix(airquality),k=5,rng.seed = 325)$data)
# Converting the ozone imputed to whole numbers since it is represented in ppm
dat$Ozone <- round(dat$Ozone)

Thus now the there are no missing values in the data set with the mean of complete cases in the data set being 1

Trend of Ozone vs Solar Radiation in NY

Trend of Ozone vs Wind in NY

Trend of Ozone vs Temperature in NY

Conclusions

  • There is a greater association of ozone levels to solar radiation and both tend to increase during July-September period
  • There is weak association of ozone levels to temperature of the day and both tend to increase during July-September period
  • There seems to be no association between ozone levels and wind speed and wind speeds remain fairly similar across the months

All of these points will need to be confirmed by some kind of statistical tests, either by Student’s t-test or building a linear regression model

Appendix

Source

The data were obtained from the New York State Department of Conservation (ozone data) and the National Weather Service (meteorological data)

References

Chambers, J. M., Cleveland, W. S., Kleiner, B. and Tukey, P. A. (1983) Graphical Methods for Data Analysis Belmont, CA: Wadsworth.

R markdown details

Written in Rmarkdown file in R version 4.3.1 (2023-06-16 ucrt) using RStudio IDE
Packages used,

  • datasets : Version 4.3.1
  • impute : Version 1.74.1
  • ggplot2 : Version 3.4.2
  • plotly : Version 4.10.2